Skip to content

fix(kg): deterministic raw-source provenance (NUL strip, flag resolution, embedding/Phase-4b race)#197

Draft
Number531 wants to merge 2 commits into
mainfrom
fix/kg-raw-source-provenance
Draft

fix(kg): deterministic raw-source provenance (NUL strip, flag resolution, embedding/Phase-4b race)#197
Number531 wants to merge 2 commits into
mainfrom
fix/kg-raw-source-provenance

Conversation

@Number531

Copy link
Copy Markdown
Owner

What this accomplishes

Makes the KG raw-source provenance pipeline deterministic and self-healing, so the GET /api/db/sessions/:sessionKey/kg/raw-sources/:nodeId route (and the frontend Evidence Trail / Findings deep-links that depend on it) reliably resolves to real source attribution instead of silently returning nothing.

Two commits:

  1. fix(kg): deterministic raw-source provenance — NUL strip, flag resolution, embedding/Phase-4b race
  2. fix(kg): review follow-ups — embedding-service availability before phase4b repair

Why it was performed

The frontend team reported kg/raw-sources returning {"sources":[]} for the Cardinal reference session. Investigation found three independent latent defects that, combined, meant explicit kg_provenance.source_hash links were never produced for any session in the DB:

  1. Ordering race (load-bearing): the post-session KG build (Phase 4b, which links nodes → raw-source chunks) fires on a 5s setTimeout whose delay is for report writes, not embeddings. The raw-source embedding queue is fire-and-forget. So Phase 4b consistently ran against an empty source_chunk_embeddings and wrote zero provenance.
  2. RAW_SOURCE_EMBEDDING inert at runtime: the flag is true in flags.env but the dev entrypoint loaded only .env, and featureFlags is a hoisted static import — so the flag resolved false and the embedding dispatcher ran as a no-op stub.
  3. NUL-byte silent data loss: the embedding INSERT had no NUL stripping; 14/367 Cardinal sources failed with invalid byte sequence for encoding "UTF8": 0x00, swallowed by a fire-and-forget .catch().

End objective

Future live sessions self-populate source_chunk_embeddings and explicit kg_provenance.source_hash with no manual backfill, and the route uses the preferred deterministic path. A reconciliation safety-net repairs any session that still raced.

Changes (file-level)

  • src/utils/rawSource/SourceChunker.js — strip NUL (\0) from chunk content/header (single chokepoint for all raw-source chunks; mirrors the existing kgPhase4cNodeEmbeddings.js precedent). + test/sdk/source-chunker-nul.test.js.
  • src/server/loadFlagsEnv.js (new) + claude-sdk-server.js — first-imported side-effect loader so flags.env lands in process.env before the hoisted featureFlags import evaluates (fixes the dev hoisting gap); [Startup] Feature flags banner; broadened embedding-service init gate to also fire when RAW_SOURCE_EMBEDDING is on.
  • src/config/featureFlags.jsvalidateFlags() dependency-validation (warns when RAW_SOURCE_EMBEDDING=true but ARCHIVE/HOOK_DB/(GEMINI||GOOGLE)_API_KEY missing, and on the archive-on/embedding-off divergence). + test/sdk/validate-flags.test.js. Note: this branch's featureFlags.js was rebased onto current main — it preserves main's OPUS_MODEL='claude-opus-4-8' and all wrapped-subagent helpers; only validateFlags() is added.
  • src/utils/hookDBBridge.jssource_chunk_embeddings stabilization poll before the KG build (count>0 AND stable across two checks, 60s cap, proceed-on-cap), so Phase 4b sees populated embeddings.
  • src/utils/sessionReconciliation.js — safety-net: re-run Phase 4b for built sessions that have embeddings but no source_chunk provenance, gated by a one-shot kg_phase4b_repaired marker + an embedding-service availability probe (so the marker isn't burned when the service is down).
  • src/utils/knowledgeGraph/repairSourceProvenance.js (new) — idempotent DELETE-then-rerun of Phase 4b for one session (reused by the runner + reconciliation).
  • src/utils/knowledgeGraph/kgPhases1to5.js — Phase 4b provenance write hardened: truncate source_key to varchar(200) and strip NUL (previously dropped ~6 rows/run silently).
  • src/db/postgres.js + migrations/023_kg-edges-composite-idx.js + migrations/024_sessions-kg-phase4b-repaired.{up,down}.sqlkg_edges(session_id, source_id) / (session_id, target_id) composite indexes (boot path plain; migration uses CREATE INDEX CONCURRENTLY) + the kg_phase4b_repaired column.
  • scripts/run-phase4b.mjs + scripts/backfill-source-chunk-embeddings.mjs (new) — standalone Phase 4b runner + historical-session backfill tool (for sessions predating the live fixes).

Risks

Risk Severity Mitigation
Enabling embedding adds real per-session Gemini cost + DB growth MED RAW_SOURCE_EMBEDDING=false is the instant rollback; dispatcher has built-in backpressure (maxQueueDepth=50, concurrency=2); fire-and-forget (never blocks sessions). Watch Gemini spend + source_chunk_embeddings growth post-deploy.
Stabilization poll adds ≤60s to the post-session KG build LOW Background task only (never blocks user response); hard cap; proceed-on-cap; mirrors the existing reports-poll.
Reconciliation re-runs Phase 4b forever on zero-match sessions LOW One-shot kg_phase4b_repaired marker + embedding-service probe (skip without burning marker if service down).
loadFlagsEnv changes local-dev flag resolution (enables flags.env content) LOW Intended dev=prod parity; dotenv no-overwrite preserves shell vars; boot banner makes resolved state explicit.
CREATE INDEX locks large kg_edges LOW Migration uses CONCURRENTLY + IF NOT EXISTS.

Viability / verification

  • 12/12 new unit tests pass (source-chunker-nul, validate-flags) via node --test.
  • Cherry-picked cleanly onto current main; featureFlags.js conflict resolved preserving main's OPUS_MODEL=4.8 + wrapped-subagent helpers (only validateFlags added).
  • The DB-side changes (source_chunk_embeddings backfill, kg_edges indexes, kg_phase4b_repaired column, 2,055 Cardinal chunk rows + 171 explicit provenance rows) are already applied to the shared DB and validated; this PR lands the code that produces them going forward.
  • Outstanding before full trust: Tier-3 live verification (a fresh session self-populating explicit provenance through the race fix) is tracked in Tier-3 live verification: raw-source embedding → Phase 4b explicit provenance (race fix) #188 — deferred to a real session run.

CI note

This branch does not restore .github/workflows/kg-tests.yml (that workflow was removed on main). The two new tests (source-chunker-nul.test.js, validate-flags.test.js) need to be wired into main's current test setup before merge.

Context

Extracted from the v6.14/banker-qa-phase-1 worktree where they were authored out-of-scope; relocated to this isolated branch off main so the banker PR (#178) stays scoped. Related: #188 (Tier-3 verification), #23/#41 (tool/flag history).

🤖 Generated with Claude Code

Number531 and others added 2 commits May 31, 2026 13:01
…tion, embedding/Phase-4b race

Durable backend correction so kg/raw-sources provenance populates reliably on
live sessions instead of depending on a one-off backfill + the frontend's
semantic fallback. Root causes (all confirmed): a fire-and-forget embedding
queue racing the post-session KG build, RAW_SOURCE_EMBEDDING silently inert in
dev (flags.env never loaded before the hoisted featureFlags import), and a
NUL-byte INSERT that silently dropped sources.

A. SourceChunker.chunk() strips NUL (0x00) from chunk content/header — Postgres
   rejects 0x00; 14/367 Cardinal sources were silently dropped. Mirrors the
   kgPhase4cNodeEmbeddings precedent. + source-chunker-nul.test.js.
B. loadFlagsEnv.js (first import) loads flags.env before featureFlags evaluates,
   fixing the dev hoisting gap. Boot '[Startup] Feature flags' banner +
   validateFlags() warns on RAW_SOURCE_EMBEDDING prerequisites
   (ARCHIVE/HOOK_DB/GEMINI||GOOGLE) and the archive-on/embedding-off divergence.
   + validate-flags.test.js.
C. hookDBBridge SessionEnd: stabilization poll waits for source_chunk_embeddings
   to drain (count>0 & stable, 60s cap, proceed-on-cap) before the KG build, so
   Phase 4b sees populated embeddings. Reconciliation safety-net re-runs Phase 4b
   for built-but-unlinked sessions, gated by one-shot kg_phase4b_repaired marker.
D. scripts/run-phase4b.mjs + repairSourceProvenance(): idempotent DELETE +
   re-run of Phase 4b for one session (no full-rebuild provenance duplication).
   Hardened phase4b provenance write: truncate source_key to varchar(200) and
   strip NUL (previously dropped ~6 rows/run silently).
E. kg_edges composite indexes (session_id, source_id)/(session_id, target_id)
   for the fallback neighbor CTE — boot path (plain, pre-listen) + migration 023
   (CONCURRENTLY) for live deploys.
G. Commit scripts/backfill-source-chunk-embeddings.mjs as the historical-session
   backfill tool.

Rollout: RAW_SOURCE_EMBEDDING stays ON; RAW_SOURCE_EMBEDDING=false is the instant
rollback. Route-contract hardening (linkage_mode, threshold) is a separate PR on
frontend-revamp. 424/424 KG unit tests pass; Cardinal repaired (171 source_chunk
provenance rows, idempotent re-run verified).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ase4b repair

Addresses gaps found in self-review of 684b293a:

- CRITICAL: reconciliation could burn the one-shot kg_phase4b_repaired marker when
  the embedding service was unavailable (e.g. RAW_SOURCE_EMBEDDING=true but
  EMBEDDING_PERSISTENCE=false → initEmbeddingService never ran → embedDocuments
  null → Phase 4b wrote 0 rows → marker set → never retried). Two-part fix:
  (a) broaden the boot init gate to also initialize the embedding service when
      RAW_SOURCE_EMBEDDING is on (not only EMBEDDING_PERSISTENCE);
  (b) probe the embedding service before the reconciliation repair pass and skip
      WITHOUT setting the marker when it's unavailable, so the session is retried
      once the service is healthy. The marker is only set when the service is
      confirmed live (a true zero-match terminal state).

- MEDIUM: backfill-source-chunk-embeddings.mjs had a raw NUL byte embedded in the
  stripNul regex literal (made the file non-UTF8). Replaced with the /\0/g escape.

424/424 KG unit tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Number531 added a commit that referenced this pull request Jun 2, 2026
Renumber 022_kg-nodes-embedding-hnsw → 025 to clear a number collision with
main's 022_artifact-source-width (8.0.x wrapped-subagents line) and #197's
reserved 023/024. Two differently-named NNN_ migrations produce NO git conflict,
so the collision is invisible to conflict review — node-pg-migrate silently
skips one on fresh/production deploys. Content is idempotent
(CREATE INDEX IF NOT EXISTS) so the renumber is data-safe.

Add scripts/check-migration-collisions.mjs + .github/workflows/migration-lint.yml:
a CI guard that fails when two migrations share a numeric prefix, running against
the PR merge-result so a feature branch colliding with main is caught before merge.
This is the SECOND occurrence of the class on this branch (prior: 011→022 rename),
so the systemic guard converts an invisible production-migration-skip into a loud
red check for all future cross-branch merges.

See docs/pending-updates/Banker-Merge-Risk.md §3.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant